11 research outputs found
Document classification based on library catalogue metadata
Kansalliskirjastojen metadataluettelot ovat hyviä informaatiolähteitä, sillä ne sisältävät tiedon lähes kaikesta tiettynä aikana ja tietyllä alueella julkaistusta aineistosta. Yleensä ne ovat kattavasti kuvailtuja, joten niitä voi käyttää kvantitatiivisen tutkimuksen lähteinä. Usein tutkimusta tehtäessä tutkimusaineisto kannattaa jakaa pienempiin osiin esimerkiksi genren perusteella. Monissa tapauksissa aineiston aukkoisuus kuitenkin vähentää aineiston käytettävyyttä. Tämä pro gradu -työ arvioi mahdollisuutta hyödyntää koneoppimista etsittäessä tutkimukselle relevantteja osajoukkoja kirjastoluetteloista. Esimerkkitapaukseksi valitsin English Short Title Cataloguen (ESTC) ja etsittäväksi osajoukoksi runokirjat. Runokirjojen genretiedon kuuluisi olla annotoitu, mutta todellisista kirjastoluetteloista tämä tieto usein puuttuu.
Käytin random forest -algoritmiä perinteisillä tekijän tunnistuksessa ja genreluokittelussa käytetyillä erityyppisillä piirrevektoreilla sekä metadatakenttien arvoilla parhaan tuloksen saamiseksi. Koska kirjastoluettelot eivät sisällä kirjojen koko tekstiä, piirteiden valinta keskittyi otsikoissa käytettyihin sanoihin ja lingvistisiin ominaisuuksiin. Otsikot ovat yleensä lyhyitä ja sisältävät hyvin vähän informaatiota, minkä vuoksi yhdistin piirrevektoreiden parhaiten toimivat piirteet yhteen ja tein lopullisen haun niillä. Tutkimuksen päätulos oli varmistus siitä, että otsikoiden käyttö piirteiden muodostamisessa on käyttökelpoinen strategia. Tutkimus avaa mahdollisuuksia määrittää osajoukkoja tulevaisuudessa koneoppimisen keinoin ja lisätä kirjastoluetteloiden hyödyntämistä kvantitatiivisessa tutkimuksessa
Bibliographic Data Science and the History of the Book (c. 1500–1800)
National bibliographies have been identified as a crucial resource for historical research on the publishing landscape, but using them requires addressing challenges of data quality, completeness, and interpretation. We call this approach bibliographic data science. In this article, we briefly assess the development of book formats and the vernacularization process in early modern Europe. The work undertaken paves the way for more extensive integration of library catalogs to map the history of the book.Peer reviewe
FinnFN 1.0: The Finnish frame semantic database
The article describes the process of creating a Finnish language FrameNet or FinnFN, based on the original English language FrameNet hosted at the International Computer Science Institute in Berkeley, California. We outline the goals and results relating to the FinnFN project and especially to the creation of the FinnFrame corpus. The main aim of the project was to test the universal applicability of frame semantics by annotating real Finnish using the same frames and annotation conventions as in the original Berkeley FrameNet project. From Finnish newspaper corpora, 40,721 sentences were automatically retrieved and manually annotated as example sentences evoking certain frames. This became the FinnFrame corpus. Applying the Berkeley FrameNet annotation conventions to the Finnish language required some modifications due to Finnish morphology, and a convention for annotating individual morphemes within words was introduced for phenomena such as compounding, comparatives and case endings. Various questions about cultural salience across the two languages arose during the project, but problematic situations occurred only in a few examples, which we also discuss in the article. The article shows that, barring a few minor instances, the universality hypothesis of frames is largely confirmed for languages as different as Finnish and English.Peer reviewe
Bibliographic Data Science and the History of the Book (c. 1500–1800)
National bibliographies have been identified as a crucial resource for
historical research on the publishing landscape, but using them requires
addressing challenges of data quality, completeness, and
interpretation. We call this approach bibliographic data science.
In this article, we briefly assess the development of book formats and
the vernacularization process in early modern Europe. The work
undertaken paves the way for more extensive integration of library
catalogs to map the history of the book.</p
A Quantitative Approach to Book-Printing in Sweden and Finland, 1640–1828
Several cities in Sweden have been providing book-printing facilities
since the 1640s. In our quantitative and explorative analysis of library
catalogs from the National Library of Sweden and the National Library
of Finland we identify the general trends in publishing, how
book-printing has been affected by political events, and how printing
developed at different paces in different parts of the realm. We have
developed a new method for analyzing the totality of publishing through
extensive data harmonization and comprehensive statistical analysis, and
by treating library catalogs not as an endpoint of bibliographic
research but as an inherently rich source of information. This
facilitated the quantitative assessment of printing in the Swedish realm
based on the metadata contained in library catalogs. Our data-driven
approach to the transformation of public discourse demonstrates that
whereas the amount of printed material grew steadily, political ruptures
affected the development of printing. We also suggest that the culture
of books and printing is best understood through the dynamics of
competing intellectual hubs consisting of the university cities and the
political center in Stockholm. This perspective further challenges the
dominant, nationally delineated approach in book history.</p
A National Public Sphere? Analysing the Language, Location and Form of Newspapers in Finland, 1771–1917
Peer reviewe
Analytical Edition Detection In Bibliographic Metadata
Analytical bibliography's aim is to understand books and other printed objects as artifacts and how they were produced. Bibliographic metadata can represent important historical trends and resolve issues such as the ordering of editions. In this paper, we present the state of the art analytical approach for determining editions and their ordering. By providing harmonized data and information on historical developments in book production, this will be a great aid for projects aiming to do large-scale text mining. Contemporary text mining approaches do not utilize edition level information to the fullest extent and therefore are limited in their scope. Using the ESTC metadata, we have developed harmonizing techniques that convert free-form text into more coherent entries for statistical analysis. Furthermore, a new gold standard was developed for validation purposes, with multiple layers of information. The use of this data would significantly enhance the understanding of early modern publishing.Peer reviewe
Book Formats and Reading Habits in Early Modern Europe
Abstract and poster of paper 0596 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherlands 9-12 July, 2019